Prediction in Baseball

Jim Albert, Emeritus Professor, BGSU

2025-06-01

Introduction

Using Prediction to Make Sense of Baseball Patterns

  • Talk about two issues in baseball, streakiness and growth in home run hitting

  • Propose some models for baseball outcomes and home run hitting

  • Simulate predictions based on fitted models

  • Compare predictions with observed

Exploring Streaky Patterns

Looking for True Streakiness

  • Collect individual plate appearance data for all players in a MLB season
  • Focus on patterns of hot and cold hitting
  • Is there evidence that players are truly streaky?
  • Maybe we are observing coin-flipping behavior?

Binary Sequence

  • Observe sequence of hitting data for a player
  • For each plate appearance, observe a “success” (1) or “failure” (0)
  • Focus on pattern of streaks and slumps
  • Look at spacings, the number of “failures” between consecutive “successes”

Different Definitions of Success

  • “success” = HIT
  • “success” = HOME RUN
  • “success” = STRIKEOUT

Example - Mike Schmidt Home Run Spacings

  • In 1980 season, Schmidt hit 48 home runs on these PAs:

25 32 41 45 72 76 86 87 100 131 141 150 160 162 176 178 182 187 221 228 269 301 316 339 342 343 368 406 414 420 425 433 454 455 473 522 540 554 578 588 596 598 604 616 637 640 645 652

  • Spacings are 25, 7, 9, 4, 27, …

Need a Streaky Measure

  • Consider two probability models - Consistent and Streaky
  • Construct a Bayesian measure to distinguish between the models

Geometric Model

  • Let \(y_1, ..., y_n\) denote the observed spacings.
  • Assume \(y_j\) are independent \[ y_j \sim Geometric (p_j) \]
  • Put models on the probabilities \(p_1, ..., p_n\)

Two Models - Consistent and Streaky

  • Model \(C\): Hitter is truly consistent

\[p_1 = ... = p_n = p\]

  • Model \(S\): Hitter is truly streaky

the \(p_j\) are different and distributed according to a Beta(\(a, b\)) curve, for specified values of \(a\) and \(b\)

Bayes Factor

  • Bayes factor in support of streaky \(S\) is \[ BF = \frac{f(y | S)}{f(y | C)} \]

  • If \(\log BF > 0\), support for true streakiness.

  • Here we say that player is streaky if \(\log BF > 0.5\)

Prediction Exercise

  • Assume Consistent Model where each player has a single probability of success.

  • Estimate probabilities \(P_1, ..., P_N\) for \(N\) players using an exchangeable model.

  • Simulate binary outcomes from Bernoulli distributions using these probability estimates.

  • Using Bayes factor, find the fraction of streaky hitters.

Predictive Simulation

  • Set definition of success (HIT or SO)

  • Simulate 50 replicated datasets from predictive distribution from consistent model fit

  • Plot for each season, the fraction of streaky hitters

  • Compare with observed fraction

HIT - Streakiness in 50 Simulations

HIT - Compare with Observed

HIT as Success

  • Observed fraction of streaky hitters is similar to what one would predict from consistent model.

  • Hard to identify truly streaky hitters using Hit as success.

SO - Streakiness in 50 Simulations

SO Data - Compare with Observed

SO as Success

  • Find more observed streaky players than one would predict based on simulations from a consistent model.

  • So patterns of Strikeout streakiness are “interesting”.

  • Motivates search for hitters who have streaky patterns over their careers.

Understanding Surge in Home Run Hitting

History of Home Run Rates of Balls in Play

What is Causing the Rise in Home Rate Rates?

  • Fall of 2017 a committee was charged by Major League Baseball to identify the potential causes of the increase in the rate at which home runs were hit from 2015 to 2017.

  • Committee released two reports (May 2018 and December 2019)

Possible Reasons for Increase in HRs

The batters?

  • Changes in characteristics of batted balls
  • Launch angle, exit velocity, and spray angle

The ball?

  • Changes in how the ball is made?
  • Seam height, core?
  • Drag coefficient (resistance of ball as it travels)?

Modeling Approach

  • Focus on the in-play home run rates in July 2021 and July 2022

  • Observe big drop in the HR rate

  • Is it due to the hitter’s approach?

  • Or is it due to the ball?

Graph

Generalized Additive Model

  • Express the logit of the home run probability as \[ \log \left(\frac{P(HR)}{1 - P(HR)}\right) = s(LA, LS) \]

  • \(s()\) is a smooth function of the launch angle (LA) and the launch speed (LS)

  • Generalization of the linear regression model \(y = X \beta + \epsilon\)

Approach

  • Fit a GAM model to the in-play HR data for July 2021

  • Use the model fit to predict the HR rate for July 2022 using the 2022 launch variables

  • By simulation, get a prediction distribution

Graph

Observations

  • Predictions are smaller than the observed 2021 rate. This indicates a change in the hitter launch variables.

  • But the observed 2022 rate is smaller than the prediction distribution – this indicates that the ball is deader in 2022

Aaron Judge

  • Slugger currently playing for Yankees
  • Broke American League HR record with 62 in 2022
  • Currently has hit 330 HR in career

Aaron Judge in 2022

  • Hit 62 home runs during a season when the ball was relatively dead

  • Raises the question: How many home runs would Judge hit during a different season during Statcast era?

Methodology

  • Suppose the different season is 2019.

  • Fit a “2019 ball model” that predicts the probability of a HR in 2019 given values of the launch angle and exit velocity.

  • Collect the launch variables for Judge for all balls put into play. For each BIP, predict P(HR) using 2019 ball model.

  • Sum the probabilities – predict the season HR.

Predict

  • For each Judge’s ball in play in 2022, predict the probability of HR from the launch variables using the 2019 ball model.

  • Sum the probabilities – predict total HR count

  • Can get a 90% prediction interval

Results

  • If Judge was hitting using a 2019 ball, predict he would hit 75 home runs

  • A 90% prediction interval would be (69, 81)

Repeat this method for other Statcast seasons

Takeaway

  • Judge only hit 62 home runs in 2022

  • But if he was playing during a different season where the ball was more alive (more carry), the prediction of his 2022 count to be in the 70’s

  • So Judge’s home run achievement is understated

  • Due to this ball bias, we don’t appreciate magnitude of Judge’s accomplishment

References

  • 2007 Streaky Hitting in Baseball, Journal of Quantitative Analysis of Sports, Vol 4, Issue 1.

  • 2013 Looking at Spacings to Access Streakiness, Journal of Quantitative Analysis of Sports, Vol 9, Issue 2.

  • 2014 Streakiness in Home Run Hitting. Chance, 27(3), 4-9.

  • 2020 The Home Run Explosion, Science Meets Sport, Cambridge Scholars Publishing.

  • 2024 Balls are Traveling Farther in 2024 in Progressive Field (with Alan Nathan), Baseball Prospectus